Run if you don’t have the packages installed.
# a helper abbreviation
`%not in%` <- Negate(`%in%`)
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
needed_packages <-
c("recount3", "maftools", "DESeq2", "TCGAbiolinks", "biomaRt")
for (pkg in needed_packages) {
if (pkg %not in% rownames(installed.packages())) {
print(paste("Trying to install", pkg))
BiocManager::install(pkg)
if ((pkg %not in% rownames(installed.packages()))) {
msg <- paste("ERROR: Unsuccessful!", pkg, "not installed!",
"Check the log and try installing the package manually.")
stop(msg)
}
}
library(pkg, character.only = TRUE)
ifelse(pkg %in% loadedNamespaces(),
print(paste("Successful!", pkg, "loaded.")),
print(paste("ERROR: Unsuccessful!", pkg,
"not loaded. Check error msg.")))
}
pkg <- "InteractiveComplexHeatmap"
if (pkg %not in% installed.packages()) {
print(paste("Trying to install", pkg))
remotes::install_github("jokergoo/InteractiveComplexHeatmap",
upgrade = "always")
library(pkg, character.only = TRUE)
ifelse(pkg %in% loadedNamespaces(),
print(paste("Successful!", pkg, "loaded.")),
print(paste("ERROR: Unsuccessful!", pkg,
"not loaded. Check error msg.")))
}
Load all the packages and define functions.
# for unified and processed RNA-seq data
library(recount3)
# to normalize the RNA-seq data
library(DESeq2)
# for access to TCGA data
library(TCGAbiolinks)
# to look at the data
library(tidyverse)
# to visualize the mutation data
library(maftools)
# to create heatmaps
library(ComplexHeatmap)
scale2 <- function(mat, ...) {
t(scale(t(mat), ...))
}
Using recount3 we download the data for a Lower Grade Glioma (LGG). In order to read more about the package and explore more of its function, refer to the manual and quick guide.
rse_gene <-
create_rse(
subset(
available_projects(),
project == "LGG" & project_type == "data_sources"
)
)
Now, let’s explore what are we looking at?
assayNames(rse_gene)
## [1] "raw_counts"
We need to scale the reads to be able to use them in DESeq2 processing.
assay(rse_gene, "counts") <-
transform_counts(rse_gene)
The attached colData contains a lot of information that we can use on top of the expression data.
sample_sheet <-
colData(rse_gene) %>%
data.frame() %>%
rownames_to_column("sample_id")
sample_sheet %>%
head(n = 3)
For plotting, we will use the variance stabilizing transformation (vst) normalized counts (as it is quicker than rlog). To read more about normalization, please read Data transformations and visualization section of RNA-seq data anlysis guide from Bioconductor.
normalized_counts <-
vst(assays(rse_gene)$counts)
## converting counts to integer mode
normalized_counts[1:5, 1:2]
## f56466dc-3c0f-48be-8a72-cedeaef52f00
## ENSG00000278704.1 2.648876
## ENSG00000277400.1 2.648876
## ENSG00000274847.1 2.648876
## ENSG00000277428.1 2.648876
## ENSG00000276256.1 2.648876
## f35f85a8-b3cd-4116-990f-94b2e432f902
## ENSG00000278704.1 2.648876
## ENSG00000277400.1 2.648876
## ENSG00000274847.1 2.648876
## ENSG00000277428.1 2.648876
## ENSG00000276256.1 2.648876
Now, to simplify our analyses we want to have a) only Tumor Samples (unless you decide otherwise) and b) one sample per patient in the data set.
sample_sheet_one <-
sample_sheet %>%
filter(tcga.cgc_sample_sample_type == "Primary Tumor") %>%
mutate(patient_id = str_extract(tcga.tcga_barcode,
"[^-]{4}-[^-]{2}-[^-]{4}")) %>%
group_by(patient_id) %>%
# this is quick, but dirty way to take just one repeat per patient
sample_n(1)
# select only the samples we want to keep
normalized_counts <-
normalized_counts[ , sample_sheet_one$sample_id]
# change rownames to nice patient ids
colnames(normalized_counts) <- sample_sheet_one$patient_id
normalized_counts[156:159, 1:2]
## TCGA-CS-4938 TCGA-CS-4941
## ENSG00000238203.8 2.648876 2.648876
## ENSG00000264393.1 2.648876 2.648876
## ENSG00000231116.9 2.648876 2.648876
## ENSG00000224979.6 2.648876 2.648876
Let’s calculate the variance between the genes, it will come in handy when we would like to narrow down the analyses to the most variable subset of genes.
row_var <-
rowVars(normalized_counts)
Let’s look at the sample distances - can we see some clustering?
First, let’s calculate the distances between our samples. For the sake of speed we will only take top 35% of genes.
samples_dist <- dist(t(normalized_counts[row_var > quantile(row_var, 0.75),]))
And now, let’s see the sample distances. For that we will use heatmap. A heatmap is a visualization of a matrix in which each cell is colored according to its value.
Heatmap(as.matrix(samples_dist),
show_row_names = FALSE,
show_column_names = FALSE,
col = viridis::mako(100),
show_row_dend = FALSE)
We can see that there are some samples with high and low similarity, meaning that we can see some organization in our data.
Now, let’s generate the heatmap showing expression of different genes. For that purpose, we select top 1% of highly variable genes.
In here, the rows are genes and the columns are samples, meaning that each cell represents an expression value for a given gene in a given sample. In this heatmap, we visualized scaled values - meaning we subtracted the mean and dived by standard deviation. Scaling allows us to focus on the changes in expression between samples, regardless of absolute levels of expression of particular genes.
ht <-
Heatmap(scale2(normalized_counts[row_var > quantile(row_var, 0.99),]),
show_row_names = FALSE, show_column_names = FALSE,
clustering_distance_rows = "pearson", name = "gene expression",
col = viridis::viridis(100), use_raster = TRUE)
ht
## Loading required namespace: Cairo
Those heatmap need some work - we need to add more information coming from additional data sources. But first, let see what type of data we have access to.
We will now use the great TCGAbiolinks package that help accessing the vast data source of TCGA. Read more about this great package here.
Because of the unexpected maintenance of the GDC server we will have to use other means to download the mutations data. For that we will use the maftools::tcgaLoad function. Note that that function comes with the v2.8 version, the code below should tell you if you need to install the package.
Note2: This maf from maftools::tcgaLoad is generated based on the hg19 genome, while the TCGAbiolinks::GDCquery_Maf you could choose hg38 or hg19 (on default it downloaded the hg38).
maf <-
GDCquery_Maf("LGG", pipelines = "mutect2") %>%
read.maf(verbose = TRUE)
# now that GDC works again, we don't need this
#
# tryCatch(maf <- tcgaLoad(study = "LGG"),
# error = function(e) {
# print(paste(rep("#", 50), collapse = ""))
# print(paste0("# ERROR! Read the message below!",
# paste(rep(" ", 17), collapse = ""),
# "#"))
# print(paste(rep("#", 50), collapse = ""))
# print(e)
# print(paste("If you're seeing this message you probably don't have",
# "maftools package loaded, or have an older version.",
# "This function is available with v2.8.",
# "Install the new version of maftools package with",
# "`BiocManager::install('PoisonAlien/maftools')`",
# "and try again!"))
# })
First thing we can do is the mutation specific summary - what kind of mutations do we have in a sample? For that we will use functions from another great package - maftools a dedicated package to visualize maf data. You can see what brilliant things the package does and what things you can easily investigate here.
plotmafSummary(maf = maf, rmOutlier = TRUE,
addStat = 'median', dashboard = TRUE,
log_scale = TRUE)
First thing we see is that this cancer type does not have many mutations. We can compare it with other TCGA cancer types.
tcga_mutation_burden <-
tcgaCompare(maf = maf, cohortName = "LGG")
## Warning in FUN(X[[i]], ...): Removed 1 samples with zero mutations.
## Performing pairwise t-test for differences in mutation burden..
We can see that LGG is one of the less mutated cancers.
Now, we can look at the top mutated genes.
oncoplot(maf = maf, top = 10)
IDH1 is by far the most mutated gene - are mutations in that gene in one hotspot or distributed? Let’s visualize the mutations in this gene.
lollipopPlot(maf, "IDH1", labelPos = 'all')
## Assuming protein change information are stored under column HGVSp_Short. Use argument AACol to override if necessary.
Clearly the R132 position is a hotspot, mutated in more than 3/4 of the samples!
The case is not so clear with TP53.
lollipopPlot(maf, "TP53")
## Assuming protein change information are stored under column HGVSp_Short. Use argument AACol to override if necessary.
## 8 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
## HGNC refseq.ID protein.ID aa.length
## 1: TP53 NM_000546 NP_000537 393
## 2: TP53 NM_001126112 NP_001119584 393
## 3: TP53 NM_001126118 NP_001119590 354
## 4: TP53 NM_001126115 NP_001119587 261
## 5: TP53 NM_001126113 NP_001119585 346
## 6: TP53 NM_001126117 NP_001119589 214
## 7: TP53 NM_001126114 NP_001119586 341
## 8: TP53 NM_001126116 NP_001119588 209
## Using longer transcript NM_000546 for now.
If we would annotate all the mutations on the TP53 lolliplot we wouldn’t be able to read anything, so we need to subset the positions.
top_label <-
maf@data %>%
filter(Hugo_Symbol == "TP53") %>%
group_by(HGVSp_Short) %>%
summarise(count = n()) %>%
top_n(5) %>%
pull(HGVSp_Short) %>%
str_extract("[0-9]+")
## Selecting by count
lollipopPlot(maf, "TP53", labelPos = top_label, labPosAngle = 20)
## Assuming protein change information are stored under column HGVSp_Short. Use argument AACol to override if necessary.
## 8 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
## HGNC refseq.ID protein.ID aa.length
## 1: TP53 NM_000546 NP_000537 393
## 2: TP53 NM_001126112 NP_001119584 393
## 3: TP53 NM_001126118 NP_001119590 354
## 4: TP53 NM_001126115 NP_001119587 261
## 5: TP53 NM_001126113 NP_001119585 346
## 6: TP53 NM_001126117 NP_001119589 214
## 7: TP53 NM_001126114 NP_001119586 341
## 8: TP53 NM_001126116 NP_001119588 209
## Using longer transcript NM_000546 for now.
Similarly to TP53, ATRX is mutated at different positions rather than in one hotspot. Interestingly, this gene is enriched for Frameshift and Nonsense mutations.
lollipopPlot(maf, "ATRX")
## Assuming protein change information are stored under column HGVSp_Short. Use argument AACol to override if necessary.
## 2 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
## HGNC refseq.ID protein.ID aa.length
## 1: ATRX NM_138270 NP_612114 2454
## 2: ATRX NM_000489 NP_000480 2492
## Using longer transcript NM_000489 for now.
We can also look at the cooccurences and mutual exclusivity - which genes are frequently mutated together? Which aren’t?
somaticInteractions(maf, top = 15, pvalue = c(0.01, 0.05))
With the recount3 data comes clinical information as well. For example, we can check the sex or ethnicity of the patients with tcga.gdc_cases.demographic. columns.
sample_sheet %>%
select(starts_with("tcga.gdc_cases.demographic."))
There’s much more data in that dataframe - after all there are almost a thousand columns. However that’s only the data for samples in the expression matrix.
We can also access additional data from TCGAbiolinks, for example we can see in what subtypes were the samples separated into. This data will cover more patients - including those which samples were whole exome sequenced, but have no expression data and so on.
tcga_subtype_data <-
TCGAquery_subtype(tumor = "lgg")
## lgg subtype information from:doi:10.1016/j.cell.2015.12.028
tcga_subtype_data %>%
select(ends_with("subtype"))
The dataframe contains much more data, and the function prints out information about the publication the data comes from.
tcga_subtype_data
Now that we have information of a subtype (and other), we can add it to our heatmap.
top_annotation_df <-
tcga_subtype_data %>%
select(patient, Histology, Original.Subtype, Transcriptome.Subtype) %>%
column_to_rownames("patient")
# make sure we are taking only information about samples in RNA-seq expression
top_annotation_df <-
top_annotation_df[colnames(normalized_counts),]
# prepare colors, cause random ones are bad
# I generate those pallets https://colorbrewer2.org/
histology_cols <- c(astrocytoma = "#d8b365",
oligodendroglioma ="#99d8c9",
oligoastrocytoma = "#2ca25f")
original_cols <- c("IDHmut-non-codel" = "#fee8c8",
"IDHwt" = "#fdbb84",
"IDHmut-codel" = "#e34a33")
transcriptome_cols <- c("CL" = "#7b3294",
"ME" = "#c2a5cf",
"NE" = "#a6dba0",
"PN" = "#008837")
# create annotation
ha <-
HeatmapAnnotation(df = top_annotation_df,
which = "column",
col = list("Histology" = histology_cols,
"Original.Subtype" = original_cols,
"Transcriptome.Subtype" = transcriptome_cols),
na_col = "#999999")
Heatmap(as.matrix(samples_dist),
show_row_names = FALSE,
show_column_names = FALSE,
col = viridis::mako(100),
show_row_dend = FALSE,
top_annotation = ha)
And let’s see the transciptome heatmap with added annotation. Interestingly, even by selecting only top 1% we can see that different Transcriptome Subtypes are clustering together.
ht <-
Heatmap(scale2(normalized_counts[row_var > quantile(row_var, 0.99),]),
show_row_names = FALSE, show_column_names = FALSE,
clustering_distance_rows = "pearson", name = "gene expression",
col = viridis::viridis(100), use_raster = TRUE, top_annotation = ha)
ht
You can do much more with those data - firstly, you can draw inspiration from the vignettes of the packages. Secondly, you can see what is interesting to you, what you would want to know? Thirdly, this is already published data and you can always refer to the paper - maybe there are some figures you have an idea to improve?
Few ideas from the top of my head: * improve the heatmap by adding the subtype, sex and other variable annotations. Add the mutation in various top genes as annotations at the bottom; * reduce dimensionality of expression data with PCA or tSNE and see if you see the subtypes specific from transcriptome data * access the survival data and see how they separate based on the subtype; * visualize point mutations on the protein (for example here)
For example, in order to add IDH1 annotation, we need to join the information from two dataframes (one that has unique column ids and one that has the information about mutation).
idh1_mutation <-
maf@data %>%
mutate(patient_id = str_extract(Tumor_Sample_Barcode,
"[^-]{4}-[^-]{2}-[^-]{4}")) %>%
select(Hugo_Symbol, patient_id) %>%
filter(Hugo_Symbol == "IDH1") %>%
column_to_rownames("patient_id")
ha_bottom <-
idh1_mutation[colnames(normalized_counts), , drop = FALSE] %>%
rownames_to_column("patient_id") %>%
mutate(present = ifelse(is.na(Hugo_Symbol), FALSE, TRUE)) %>%
group_by(patient_id) %>%
summarise(IDH1 = any(present)) %>%
column_to_rownames("patient_id") %>%
HeatmapAnnotation(df = ., col = list("IDH1" = c(`TRUE` = "black",
`FALSE` = "white")))
When we have the annotation, we can add it to the plot.
ht3 <-
Heatmap(scale2(normalized_counts[row_var > quantile(row_var, 0.99),]),
show_row_names = FALSE, show_column_names = FALSE,
clustering_distance_rows = "pearson", name = "gene expression",
col = viridis::viridis(100),
top_annotation = ha, bottom_annotation = ha_bottom)
ht3
With using just one package and one command, we can make an interactive plot.
But before this, I want to translate the ensembl ids to hugo gene symbol to improve readability.
ensembl_ids <-
rownames(normalized_counts) %>%
str_remove("\\.[0-9]*")
mart <- biomaRt::useMart("ensembl", dataset = "hsapiens_gene_ensembl")
ensembl_to_hgnc <-
biomaRt::getBM(attributes = c('ensembl_gene_id', 'hgnc_symbol'),
mart = mart) %>%
filter(ensembl_gene_id %in% ensembl_ids) %>%
group_by(ensembl_gene_id) %>%
summarise(hgnc_symbol = paste(unique(hgnc_symbol)[unique(hgnc_symbol) != ""],
collapse = ",")) %>%
mutate(hgnc_symbol = ifelse(hgnc_symbol == "",
ensembl_gene_id,
hgnc_symbol)) %>%
column_to_rownames("ensembl_gene_id")
rownames(normalized_counts) <-
ensembl_to_hgnc[ensembl_ids,]
Now that we fixed the gene names, let’s look at the interactive heatmap using InteractiveComplexHeatmap package. Run below code in your notebook.
# to create interactive heatmaps
library(InteractiveComplexHeatmap)
ht_interactive <-
Heatmap(scale2(normalized_counts[row_var > quantile(row_var, 0.99),]),
show_row_names = TRUE, show_column_names = FALSE,
clustering_distance_rows = "pearson", name = "gene expression",
col = viridis::viridis(100),
top_annotation = ha, bottom_annotation = ha_bottom)
htShiny(ht_interactive)
## Loading required package: shiny
##
## Listening on http://127.0.0.1:7767
For more information and beautiful examples check out: Documentation and GitHub.
# which genes are we intersted in?
genes_of_interest <- c("IDH1", "TP53")
# this will allow us to distinguish between no infomration about mutation
# and no mutation
patient_muts <-
maf@data %>%
mutate(patient_id = str_extract(Tumor_Sample_Barcode,
"[^-]{4}-[^-]{2}-[^-]{4}")) %>%
pull(patient_id) %>%
unique()
expression_vals <-
normalized_counts[genes_of_interest,] %>%
as.data.frame() %>%
rownames_to_column("Hugo_Symbol") %>%
pivot_longer(names_to = "patient_id",
values_to = "norm_expression",
cols = -Hugo_Symbol)
mutation_data <-
maf@data %>%
filter(Hugo_Symbol %in% genes_of_interest) %>%
mutate(patient_id = str_extract(Tumor_Sample_Barcode,
"[^-]{4}-[^-]{2}-[^-]{4}")) %>%
dplyr::select(patient_id, Hugo_Symbol, VARIANT_CLASS)
expr_mut_df <-
full_join(expression_vals, mutation_data) %>%
# introduce WT for samples with no mutation in the gene
mutate(VARIANT_CLASS = as.character(VARIANT_CLASS),
VARIANT_CLASS = ifelse(is.na(VARIANT_CLASS),
yes = ifelse(patient_id %in% patient_muts,
yes = "WT",
no = "NA"),
no = VARIANT_CLASS),
VARIANT_CLASS = factor(VARIANT_CLASS,
levels = c("WT",
"SNV",
"deletion", "insertion",
"NA")))
## Joining, by = c("Hugo_Symbol", "patient_id")
expr_mut_df %>%
ggplot(aes(VARIANT_CLASS, norm_expression, color = VARIANT_CLASS,
shape = VARIANT_CLASS)) +
geom_jitter() +
geom_boxplot(color = "black", width = 0.3,
alpha = 0.7, outlier.shape = NA) +
facet_wrap(~Hugo_Symbol) +
theme_classic() +
theme(legend.position = "none") +
labs(y = "vst normalized expression",
x = "Variant classification")
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing missing values (geom_point).
For your project your goal is to describe the selected TCGA cohort:
Create a blogpost with description of your analysis, document your progress. Try to add interactivity to your visualizations.
You will have to present the data to the rest of the hackathon groups at the end of this event. Prepare 10 minutes of showcasing the data, your visualizations.
This is the main project of this hackathon. From now on you will be working with your teams on your own and I will be available to guide you, answer your questions. You are expected to spend few hours each day working on this.
Good luck! I’m already excited to see your data viz!